21 research outputs found

    Translating English Discourse Connectives into Arabic: a Corpus-based Analysis and an Evaluation Metric

    Get PDF
    Discourse connectives can often signal multiple discourse relations, depending on their context. The automatic identification of the Arabic translations of seven English discourse connectives shows how these connectives are differently translated depending on their actual senses. Automatic labelling of English source connectives can help a machine translation system to translate them more correctly. The corpus-based analysis of Arabic translations also enables the definition of a connective-specific evaluation metric for machine translation, which is here validated by human judges on sample English/Arabic translation data

    DCEP - Digital Corpus of the European Parliament

    Get PDF
    The paper presents a new highly multilingual sentence-aligned parallel corpus consisting of various document types and covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institution. Corpus statistics, required preprocessing, sentence alignment, and possible gains in statistical machine translation when adding this corpus to the previously existing ones are also considered.JRC.G.2-Global security and crisis managemen

    Disambiguating Discourse Connectives for Statistical Machine Translation

    Get PDF
    This paper shows that the automatic labeling of discourse connectives with the relations they signal, prior to machine translation (MT), can be used by phrase-based statistical MT systems to improve their translations. This improvement is demonstrated here when translating from English to four target languages - French, German, Italian and Arabic - using several test sets from recent MT evaluation campaigns. Using automatically labeled data for training, tuning and testing MT systems is beneficial on condition that labels are sufficiently accurate, typically above 70%. To reach such an accuracy, a large array of features for discourse connective labeling (morpho-syntactic, semantic and discursive) are extracted using state-of-the-art tools and exploited in factored MT models. The translation of connectives is improved significantly, between 0.7% and 10% as measured with the dedicated ACT metric. The improvements depend mainly on the level of ambiguity of the connectives in the test sets

    Automatic Alignment of Multilingual Resources in the Linguistic Linked Open Data Cloud

    Get PDF
    The creation of Europe’s Digital Single Market requires interoperable multilingual resources in the Linguistic Linked Open Data (LLOD) cloud. The PMKI project aims to create a public multilingual knowledge management infrastructure, able to establish and manage interoperability between multilingual classification systems (like thesauri) and other language resources. In this paper the standards used by PMKI and a methodology for automatic mapping between multilingual resources, based on an information retrieval framework, is presented

    Recherche et production de corpus de messages pour la multilinguisation de sites de e-commerce en SMS, initialement en arabe

    No full text
    International audienceIn this paper, we present our research in the framework of the CATS project (Classified Ads through SMS) [3]. CATS is a system of management of small Arabic advertisements posted in SMS of buying and selling ( cars, real estate...), currently deployed in Jordan by the FastLink operator. In order to adapt this system to other languages (French, English) and in length by applying it to other sectors (employment, marriage, domestic machines, trade of mobile phone, pages yellow...), we are in the difficulty in finding or building SMS corpora functionally equivalent to a real, Arab and natural corpus. A simple translation of this starting corpus gives it the same type of corpus (real, natural)? We present, in this paper, an answer to this question and a solution for a case of multilinguisation of sites of e-commerce in SMS, initially in Arabic

    Multilinguïsation des systèmes de e-commerce traitant des énoncés spontanés en langue naturelle

    No full text
    We are interested in the multilinguization, or “linguistic porting” (simpler than localization) of management content services processing spontaneous utterances in natural language, often noisy, but constrained by the situation and constituting a restricted “sublangage”. Any service of this type (App) uses a specific content representation (CR-App) on which the functionnal kernel operates. Most often, this representation is produced from the “native” language L1 by a content extractor (CE-App). We identified three possible methods of porting and have illustrated them by porting to French a part of CATS, a system handling small ads in SMS (in Arabic), deployed in Amman, as well as IMRS, a music retrieval system, where the native natural language interface is in Japanese and only the CR is accessible. These are: (1) “internal localisation”, i.e. adaptation to L2 of the CE, giving CE-App-L2; (2)”external” localization , i.e. adaptation of an existing CE for L2 to the domain and to the App content representation (CE-X-L2-App); (3) translation of utterances from L2 to L1. The choice of the strategy is constrained by the translational situation: type and level of possible access (complete access to the source code, access limited to the internal representation, access limited to the dictionary, and no access), available resources (dictionaries, corpus), competences in languages and linguistics of persons taking part in the multilinguisation of application. The three methods gave good results on the Arabic to French porting of the CARS part of CATS. For internal localization, the grammatical part was very little modified, which proves that, despite the great distance between Arabic and French, these two sub-languages are very near one to another. This is a new illustration of R.Kittredge's analysis. The external localization was experimented with CATS and with IMRS by adapting to the new domain the French content extractor written initially by H. Blanchon for the tourism domain (CSTAR/Nespole! project), and then by changing the language for IMRS (English). Finally, porting by statistical MT gave also a very good performance, and that with a very small training corpus (less than 10 000 words) and a complete dictionary. This proves that, in the case of very small sub-languages, statistical MT may be of sufficient quality, starting from a corpus 100 to 500 smaller than for the general language.Nous nous intéressons à la multilinguïsation, ou « portage linguistique » (plus simple que la localisation) des services de gestion de contenu traitant des énoncés spontanés en langue naturelle, souvent bruités, mais contraints par la situation, et constituant toujours un « sous-langage » plus ou moins restreint.Un service de ce type (soit App) utilise une représentation du contenu spécifique (RC-App) sur laquelle travaille le noyau fonctionnel. Le plus souvent, cette représentation est produite à partir de la langue « native » L1 par un extracteur de contenu (EC-App). Nous avons dégagé trois méthodes de portage possibles, et les avons illustrées par le portage en français d'une partie de CATS, un système de traitement de petites annonces en SMS (en arabe) déployé à Amman, ainsi que sur IMRS, un système de recherche de morceaux de musique dont l'interface native est en japonais et dont seule la RC est accessible. Il s'agit de : (1) localisation « interne », i.e. adaptation à L2 de l'EC donnant EC-App-L2 ; (2) localisation « externe », i.e. adaptation d'un EC existant pour L2 au domaine et à la représentation de contenu de App (EC-X-L2-App); (3) traduction des énoncés de L2 vers L1.Le choix de la stratégie est contraint par la situation traductionnelle : types et niveau d'accès possibles (accès complet au code source, accès limité à la représentation interne, accès limité au dictionnaire, et aucun accès), ressources disponibles (dictionnaires, corpus), compétences langagières et linguistiques des personnes intervenant dans la multilinguïsation des applications.Les trois méthodes ont donné de bons résultats sur le portage d'arabe en français de la partie de CATS concernant l'occasion automobile. En localisation interne, la partie grammaticale a été très faiblement modifiée, ce qui prouve que, malgré la grande distance entre l'arabe et le français, ces deux sous-langages sont très proches l'un de l'autre, une nouvelle illustration de l'analyse de R. Kittredge. La localisation externe a été expérimentée sur CATS et sur IMRS en adaptant au nouveau domaine considéré l'extracteur de contenu du français écrit initialement par H. Blanchon pour le domaine du tourisme (projet CSTAR/Nespole!), puis en changeant de langue pour IMRS (anglais).Enfin, le portage par TA statistique a également donné de très bonnes performances, et cela avec un corpus d'apprentissage très petit (moins de 10.000 mots) et un dictionnaire complet. Cela prouve que, dans le cas de sous-langages très petits, la TA statistique peut être de qualité suffisante en partant de corpus 100 à 500 fois moins grands que pour de la langue générale

    Localizing Content Management Application for Spontaneous Textual Utterances in Natural Language.

    No full text
    International audienceThe multilinguisation of content management services is an important but difficult problem and very few services do it. In fact it depends on the translation situation: types and level of possible accesses, available resources, linguistic competences of participants in the multilinguisation of application. Several strategies of multilinguisation are then possible (by translation, by internal or external localization etc.). We illustrate this study by a real case of linguistic porting (Arab to French) of an E-commerce application deployed in Jordan, using texts of spontaneous SMS for buying and selling second-hand cars. In spite the long distance between Arabic and French, the localization methods used give good results because of the proximity of the two sublanguages of Arabic and French

    Multilinguïsation des systèmes traitant des sous-langages

    No full text
    Dans le cadre de nos travaux sur la multilinguïsation ou « portage linguistique » des services de gestion de contenu traitant des énoncés spontanés en langue naturelle, nous avons dégagé trois méthodes de portage possibles d’une langue L1 vers une nouvelle langue L2, et les avons appliquées sur des cas de systèmes de e-commerce. Le portage par traduction statistique, une de ces trois méthodes, a donné de très bonnes performances, et ce, avec un corpus d’apprentissage très petit (moins de 10 000 mots). Cela prouve que, dans le cas de sous-langages très petits, la traduction statistique peut être de qualité suffisante en partant de corpus 100 à 500 fois moins grands que pour de la langue générale.This article focuses on our work on multilinguization, or “linguistic porting,” and content management services. These systems handle spontaneous, natural-language utterances. Within this framework, we developed three methods for porting language L1 to a new language, L2, and have applied them to e-commerce. Statistical translation porting is one of these methods and performed very well with a very small training corpus (less than 10,000 words). This proves that, in the case of very small sub-languages, statistical translation may be of sufficient quality when working from a corpus 100 to 500 times smaller than for general language

    MultilinguĂŻsation de services de gestion de contenu.

    No full text
    International audienceLa multilinguïsation des services de gestion de contenu est peu fréquente. Elle constitue un problème important et difficile. En effet, cela dépend de la situation traductionnelle qui représente un ensemble de facteurs prépondérants : types et niveau d'accès possibles, ressources disponibles, compétences langagières et linguistiques des intervenants pour la multilinguisation des applications... Plusieurs stratégies de multilinguisation sont alors possibles (par adaptation interne ou externe, par traduction ...). Nous illustrons cette étude par un cas réel de portage linguistique (arabe vers français) d'application d'e-commerce déployée et traitant des textes de SMS spontanés concernant l'achat et la vente de voitures d'occasion. Malgré la grande distance entre l'arabe et le français, les méthodes de localisation utilisées marchent bien à cause de la proximité des deux sous-langages
    corecore